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ABSTRACT 

Over the last years, dictionary learning method has been 
extensively applied to deal with various computer vision 
recognition applications, and produced state-of-the-art re¬ 
sults. However, when the data instances of a target domain 
have a different distribution than that of a source domain, the 
dictionary learning method may fail to perform well. In this 
paper, we address the cross-domain visual recognition prob¬ 
lem and propose a simple but effective unsupervised domain 
adaption approach, where labeled data are only from source 
domain. In order to bring the original data in source and target 
domain into the same distribution, the proposed method forc¬ 
ing nearest coupled data between source and target domain 
to have identical sparse representations while jointly learning 
dictionaries for each domain, where the learned dictionaries 
can reconstruct original data in source and target domain re¬ 
spectively. So that sparse representations of original data can 
be used to perform visual recognition tasks. We demonstrate 
the effectiveness of our approach on standard datasets. Our 
method performs on par or better than competitive state-of- 
the-art methods. 

Index Terms — dictionary learning, cross-domain, do¬ 
main adaption, visual recognition 

1. INTRODUCTION 

In the past decade, machine learning has been widely 
used for various computer vision applications, such as object 
recognition [lj, multimedia retrieval mum image classi¬ 
fication a. etc. Traditional machine learning methods often 
learn a model from the training data, and then apply it to 
the testing data. The fundamental assumption here is that 
the training data and testing data have the same distribution. 
However, in real-world applications, it cannot always guaran¬ 
tee that training data share the same distribution with testing 
data. Therefore, it may produce very poor results when the 
testing data and training data have the different distributions 
since the training model is no longer optimal on testing data. 
For example, applies image classification classifier trained on 
amazon dataset to phone photos in real life. Face recognition 
model trained on frontal and well-illumination images to rec¬ 


ognize non-frontal poses and less-illumination images. This 
often viewed as visual domain adaption problem which has 
been increasing interest in understanding and overcoming. 

Domain Adaption aims at learning an adaptive classifier 
by utilizing the information between source domain with a 
plenty of labeled data and target domain which is collected 
from a different distribution. Generally, we can divide do¬ 
main adaption into two settings depending on the availability 
of labels in the target domain data: semi-supervised domain 
adaption, and unsupervised domain adaption. In scenario of 
semi-supervised domain adaption, labeled data is available in 
both source domain (with a plenty of labeled data) and target 
domain (with a few labeled data), while in scenario of un¬ 
supervised domain adaptation labeled data are only available 
from source domain. In this paper, we mainly focus on unsu¬ 
pervised domain adaptation which is a more challenging task, 
and more in line with the real-world applications. 

Many recent works 0 HE) focus on subspace based 
method to tackle visual domain adaption problems. In (8), Li 
et al. determined a feature subspace via canonical correlation 
analysis (CCA) ED for recognizing faces with different poses. 
In 0, Gopalan et al. using geodesic flows to generate inter¬ 
mediate subspaces along the geodesic path between source 
domain subspace and target domain subspace on the Grass- 
mann manifold. In 0, Gong et al. proposed Geodesic Flow 
Kernel (GFK), which computes a symmetric kernel between 
source and target points based on geodesic flow along a latent 
manifold. 

In last few years, the study of dictionary learning based 
sparse representation has received extensive attention. It has 
been successfully used for a variety of computer vision ap¬ 
plications. For example, classification tna, recognition am 
and denoising lfl2l . Using an over-complete dictionary, sig¬ 
nal or image can be approximated by the combination of only 
a few number of atoms, that are chosen from the learned dic¬ 
tionary. One of the early dictionary learning algorithms was 
proposed by Olshausen and Field fl3l . where a maximum 
likelihood (ML) learning method was used to sparsely en¬ 
code images upon a redundant dictionary. Based on the same 
ML objective function as in ED, Eng an et al. M devel¬ 
oped a more efficient algorithm, called the method of optimal 
directions (MOD), in which a closed-form solution for the 



Fig. 1. The overall schema of the proposed framework. 


dictionary update has been proposed. More recently, in jl5l . 
Aharon, Elad and Bruckstein proposed the K-SVD algorithm 
by generalizing k-means clustering and efficiently learns an 
over-complete dictionary from a set of training signals. This 
method has been implemented in a variety of image process¬ 
ing problems. 

The most existing dictionary based methods assuming 
that training data and testing data come from the same distri¬ 
bution. However, the learned dictionary may not be optimal 
if the testing data has different distribution from the data used 
for training. Learning dictionaries under different domain is 
a challenging task, and gradually become a hot research over 
the last few years. In fl6l . Jia et al. considered a special case 
where corresponding samples from each domain were avail¬ 
able, and learn a dictionary for each domain. Qiu et al. Dll 
presented a general joint optimization function that trans¬ 
forms a dictionary learned from one domain to the other, and 
applied such a framework to applications such as pose align¬ 
ment, pose illumination estimation, and face recognition. 
Zheng et al. |[T8l proposed a method achieved promising 
results on the cross-view action recognition problem with 
pairwise dictionaries constructed using correspondences be¬ 
tween the target view and the source view. In |fl9l , Shekhar 
et al. learn a latent dictionary which can succinctly represent 
both the domains in a common projected low-dimensional 
space. Ni et al. lf20j learn a set of subspaces through dictio¬ 
nary learning to mitigate the divergence of source and target 
domains. Huang and Wang ED proposed a joint model 
which learns a pair of dictionaries with a feature space for de¬ 
scribing and associating cross-domain data. In l22ll23ll . Zhu 
and Shao proposed a weakly-supervised framework learns a 
pairwise dictionaries and a classifier while considering the 
capacity of the dictionaries in terms of reconstructability, 
discriminability and domain adaptability. 

In this paper, we present an unsupervised domain adap¬ 
tion approach through dictionary learning. Different from 
above dictionary learning based domain adaption methods, 
our method directly learning adaptive dictionaries in low- 
level feature space and with no need for labels either in 
source domain or target domain during dictionary learning 
process. Our method is inspired by li22l l23l . which forcing 


the similar samples in the same class to have identical rep¬ 
resentations in the sparse space. However, our method is 
unsupervised, we assume that the nearest coupled low-level 
features in the original space should maintain their relation¬ 
ship in the sparse space (i.e. these coupled features have the 
same sparse representation). According to this main idea, 
we learn a transformation matrix, which selected the nearest 
data in source domain to each target data. Then the dictio¬ 
naries for each domain are jointly learned by these selected 
source data and target data. The data from each domain 
can be encoded by their dictionaries and then represented by 
sparse features. Thus, SVM classifier can be trained using 
these sparse features, and predicting test data on the learned 
classifier. The learning framework is performed by a classic 
and efficient dictionary learning method, K-SVD 031 . We 
demonstrate the effectiveness of our approach on standard 
cross-domain datasets, and we get state-of-the-art results. An 
overall schema of the proposed framework is shown in Fig.l. 

2. PROPOSED METHOD 

2.1. Problem Notation 

Let I s = {I Sfi }^ s r, and I t = {It,j}f=i be the data in¬ 
stances from the source and target domain respectively, where 
N s and N f denote the number of samples. Each sample from 
I s and I t has a set of d-dimensional local features, thus each 
sample can represented by J Sii = {73,73,...,/^} and 

4, - Uh.iij,- , } in source and target domain re¬ 

spectively, where M, and Mj denote the number of local fea¬ 
tures. Then, the set of local features of source and target do¬ 
main can be denoted as Y s £ R d * is , and Y t £ R d * Lt respec¬ 
tively, where L s and L t denote the number of local features 
in the source and target domain. 

2.2. Dictionary Learning 

Here, we give a brief review of classical dictionary learn¬ 
ing approach. Given a set of d-dimensional input signals, 
Y £ W. d * L , where L is denoted as the number of input sig¬ 
nals. Then, learning a A'-atoms dictionary of the signals Y, 































D £ M. d * K , can be obtained by solving the following opti¬ 
mization problem: 

{D,X} = argmin D ,x\\Y - DX\\% 
s.t. Vi, Iloilo < T 0 

where D = [di, c? 2 , d F ] £ W. d * K denotes the dictionary, 
X = [x\,X 2 , ■■■jXl] £ denotes the sparse coefficients 

of Y decomposed with D, and To is the sparsity level that 
constraint the number of nonzero entries in Xi. 

The performance of sparse representation strictly lie on 
dictionary learning method. The K-SVD algorithm lfl5l is a 
highly effective dictionary learning method that focuses on 
minimizing the reconstruction error. In this paper, we will 
solve our formulation of unsupervised domain adaption dic¬ 
tionary learning based on the K-SVD algorithm. 

2.3. Unsupervised Domain Adaption Dictionary Learn¬ 
ing 

Now, consider a more general scenario, where we have 
data from two domains, source domain Y s £ W L * Ly: , and tar¬ 
get domain Y t £ M. d * Lt . We wish to jointly learning corre¬ 
sponding dictionaries for each domain. Formally, we desire 
to minimize the following cost function: 

{D s ,D t ,X s ,Xt} 

= argminD a ,D u x s ,x t \\Y s - D 8 X S ^% (2) 

+ \\Y t - D t X t f F s.t. Mo, Iloilo] < T 0 

In addition, in order to maintain the relationship in orig¬ 
inal feature space, we assume that the nearest coupled low- 
level features in the original space should also be the nearest 
couple in the sparse space. Now the new cost function is given 
by: 

{D a ,D t ,X a ,X t } 

= argmin DatDttXs ,x t \\Y a - D S X 8 \\ 2 F 

+ \\Y t -D t X t \\ 2 F +C([X 8 X t }) ; ' l! 

s.t. Vi,Mo,||*‘||o] <T 0 

where D s = [dl,d%, ...,d s K ] £ R d * K is the learned source 
domain dictionary, X s = [x\,X 2 ,—,x s L ] £ R^* 1 " 8 is the 
sparse coefficients of source domain, D t = [d \, d|,..., d ^-] £ 
S. d * K is the learned target domain dictionary, and X f = 
[x|,X 2 , ...,a:^J £ R A * Lt is the sparse coefficients of target 
domain. The function C(-) is defined as the distance in the 
new sparse space of original nearest couples, a small C(-) 
indicates the data maintain more relationship in new sparse 
space. This idea is inspired by |22]|23|, in their method, this 
function is designed to measure the distances of similar cross¬ 
domain instances of the same class. However, our method 
is exactly unsupervised and directly perform on low-level 
feature. Thus, the function C(\X s Xt]) is defined as: 

(4) 


where P £ U Ly * Ll is the transformation matrix which 
records the nearest couples between the original data in 
source and target domain, P can be represented by: 


P = 


( $(yi,y\) 


V W L 3 ,yl) 
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(5) 


where is the Gaussian distance between data in 

original feature space: 


I y 82 -y t2 

= H-) (6) 

V 27T 

Then, P can be computed by selecting the maximum entry in 
each column and set to 1 while the other entries are set to 0: 


P = ( i,j) 


1 if P{i,j) = max(P(:,j)) 

0 otherwise. 


(7) 


Thus, Equation (3) can be written as: 


{D s ,D t ,X a , X t } 

= argmiriD, ,D t ,x a ,x t \\Y 8 - D S X S \\ F 
+ \\Y t - D t X t f F + \\X t - X a Pf F 
s.t. Vi, [||<||o||x*||o] < T 0 

Assuming P leads to a perfect mapping across the sparse 
codes Xt and X s , and each nearest couple has an identical 
representation after encoding, then \\Xt — X S P\\ 2 F = 0. Thus 
Xt = X S P, we can rewritten Equation (8) as: 


{D a ,D t ,X a ,X t } 

= argmin DatDu x s ,xA\(Y a - D S X S )P\\ 2 F + \\Y t - D t X t || 
= argmin D ^ Dt ,x B ,x t \\Y s P ~ D S X S P\\ F + \\Y t - D t X t \\ 
= argmin DstDtt x s ,x t \\ Y sP ~ D s X t \\ F + \\Y t - D t X t \\ F 

s.t. Vi, ||x*||o < T 0 

(9) 

2.4. Optimization 

We can written Equation (9) as: 

{D,X} =argmin 5 ^\\Y - DX\\% 

s.t. Vi, piHo < T 0 

where Y = ^ . D = ^ ^ ,and A' = X t . Thus, 

such optimization problem can be solved using the K-SVD 
algorithm E). 


C{[X s X t }) = \\X t -X s Pf F 


(N CN 






2.5. Object recognition 

Given the learned D s and D t , we obtain sparse repre¬ 
sentations of the training data in source domain and testing 
data in target domain respectively. For each image, we obtain 
a set of sparse representation Xi = [xi : i , 2 ^ 2 ,..., £ 

M a * m % where X, 3 is the sparse representation of j th feature 
in image i, K denotes the dictionary size, and M, is the num¬ 
ber of local feature in image i. Then each image represented 
by a A'-vector global representation through max pooling the 
sparse codes of local features, and then we use linear SVM 
classifier for cross-domain recognition. 

3. EXPERIMENTS 

In this section, we evaluate our domain adaption ap¬ 
proach on 2D object recognition across different datasets. 

Experimental Setup: Following the experiment set¬ 
ting in Q, we evaluate our domain adaption approach on 
four datasets: Amazon (images downloaded from online 
mer-chants), Webcam (low resolution images by a web 
camera), Dslr (high-resolution images by a SLR camera), 
and Caltech-256 (24). We regard each dataset as a do¬ 
main. We extract 10 classes common to all four datasets: 
BACKPACK, TOURING-BIKE, CALCULATOR, HEAD¬ 
PHONES, COMPUTER-KEYBOARD, LAPTOP-101, COM¬ 
PUTER-MONITOR, COMPUTER-MOUSE, COFFEEMUG, 
AND VIDEO-PROJECTOR. There are 2533 images in to¬ 
tal. Each class has 8 to 151 images in a dataset. We use a 
SURF detector (25) to extract local features over all images. 
For each pair of source and target domains, we use 20 train¬ 
ing samples per class for Amazon/Caltech, and 8 samples 
per class for DSLR/Webcam when used as source. To draw 
complete comparison with existing domain adaption meth¬ 
ods, we also carried out experiments on the semi-supervised 
setting where we additionally sampled 3 labeled images per 
class from the target domain. We ran 20 different trials cor¬ 
responding to different selections of labeled data from the 
source and target domains and testing all unlabeled data in 
target domain. Our baseline is BOW, where all the images 
were represented by 800-bin histograms over the codebooks 
trained from a subset of Amazon images. Our method is also 
compared with Metric (26) , SGF (6) and GFK (7). Note that. 
Metric (26) is limited to the semi-supervised setting. 

Parameter Settings: For our method, we set dictionary 
size K = 512, and sparse level To = 5 for each domain. 

Results: The average recognition rate is reported in Fig¬ 
ure 3 and Figure 4 for unsupervised and supervised settings 
respectively. It is seen that the baseline BOW has the lowest 
recognition rate, all domain adaptation methods improve ac¬ 
curacy over it. Furthermore, GFK G) based method clearly 
outperforms Metric (26l and SGF (6). Overall, our method 
consistently demonstrates better performance over all meth¬ 
ods except for one pair of source and target combination a 



A-»D A->W CH>A C->D C->W D -> W W -» A W-> D 


Fig. 2. Cross dataset object recognition accuracies on target 
domains with unsupervised adaptation over the four datasets 
(A: Amazon, C: Caltech, D: Dslr, W: Webcam). 


S] BOW n Metric [22] ■SGF[11 ■ GFK [3] BOurs 



A->D A->W C -> A C-J'D C->W D^W W-> A W-> D 


Fig. 3. Cross dataset object recognition accuracies on tar¬ 
get domains with semi-supervised adaptation over the four 
datasets (A: Amazon, C: Caltech, D: Dslr, W: Webcam). 

little less than GFK (7) in the unsupervised setting. 

4. CONCLUSIONS 

In this paper, we presented a fully unsupervised domain 
adaption dictionary learning method to jointly learning do¬ 
main dictionaries by capturing the relationship between the 
source and target domain in the original data space. We eval¬ 
uated our method on publicly available datasets and obtain 
improved performance upon the state of the art. 
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